-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set timeout for helper subprocesses to enhance stability #11125
Conversation
77e08d4
to
132fedd
Compare
…elpers_command_timeout`
ad5d950
to
dbb517d
Compare
fix spec issues
) | ||
.returns(T.nilable(T.any(String, T::Hash[String, T.untyped], T::Array[T::Hash[String, T.untyped]]))) | ||
end | ||
def self.run_helper_subprocess(command:, function:, args:, env: nil, | ||
stderr_to_stdout: false, | ||
allow_unsafe_shell_command: false, | ||
error_class: HelperSubprocessFailed) | ||
error_class: HelperSubprocessFailed, | ||
command_type: :network, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Tip: We set the default timeout to 120 seconds (using :network
).
Note: The timeout applies to the time between outputs to stdout
or stderr
, not the overall execution duration of the command. Each time the command produces output, the timeout is reset, and it waits for the next response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH this feels more complex/confusing than it needs to be. This introduces magic symbols that callers must know to use, where those symbols are mapped to timeout values in the new command helpers class.
There's also some complexity coming from the addition of both command type and timeout arguments. As a caller if I didn't know the internals of this function I would be confused that a command that was run with timeout: -1
would have a timeout applied because command_type: :network
has a special meaning. Is timeout an explicit value? An implicit value? I need to dig into the code to know how to call this.
WDYT about creating constants for different common timeout values and having callers use those constants to set a timeout value when needed. This removes having two params, makes the timeout field explicit, and maintains named common timeout values.
module CommandTimeouts
NETWORK = 120 # seconds
LONG_RUNNING = 300 # seconds
NONE = -1 # seconds
LOCAL_TASK = 30 # seconds
DEFAULT = NETWORK
end
def self.run_helper_subprocess(... timeout: CommandTimeouts::DEFAULT...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fixed. Please let me know if you see any issue now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave myself a crash course on Ruby IO stream handling, this appears correct to me based on the limited info 20 minutes of reading can provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me. Feature flagging definitely seems like the right call here. It'd be great to see an incremental rollout to make sure everything is functioning as intended.
# Use IO.select with a dynamically calculated short timeout | ||
ready_ios = IO.select(ios, nil, nil, [0.1, remaining_timeout].min) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never worked with IO.select
before, but the docs suggest the arguments should be Arrays of IO objects. What's the effect of using nil
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not monitoring for write readiness (nil
) because writing is already completed at the beginning, and we do not perform incremental writes. Similarly, we do not monitor error conditions (nil
) as they are not required in this context. For read readiness, we monitor the output streams (stdout
and stderr
) using a timeout to avoid blocking indefinitely. The timeout is set to the shorter of 0.1 seconds or the remaining timeout ([0.1, remaining_timeout].min
), ensuring responsiveness while respecting the overall timeout limit.
Process.kill("KILL", pid) # Forcefully kill if still running | ||
end | ||
rescue Errno::EPERM | ||
Dependabot.logger.error("Insufficient permissions to terminate process: #{pid}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens in this case? Will the job still terminate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In natural way if it terminates (old way) we don't do anything. If it doesn't then the operation is going to get canceled after hanging out as how it was before. But generally we should have permission to do it since we initiated the command in this thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be surprised if this ever triggers TBH. This is only ever going to terminate a sub-process, and AFAIK the parent process should always have perms to terminate a child process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I aggree.
DEFAULT_TIMEOUTS = T.let({ | ||
no_time_out: -1, # No timeout | ||
local: 30, # Local commands | ||
network: 120, # Network-dependent commands | ||
long_running: 300 # Long-running tasks (e.g., builds) | ||
}.freeze, T::Hash[T.untyped, T.untyped]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to keep these as T::Hash[T.untyped, T.untyped]
? We're controlling them and all the values suggest to me that T::Hash[Symbol, Integer]
would be the signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will fix it.
let(:output_hang_cmd) { command_fixture("output_hang.sh") } | ||
let(:error_hang_cmd) { command_fixture("error_hang.sh") } | ||
let(:invalid_cmd) { "non_existent_command" } | ||
let(:timeout) { 2 } # Timeout for hanging commands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one comment where I think the command api could be simplified a bit, otherwise looks alright 👍
) | ||
.returns(T.nilable(T.any(String, T::Hash[String, T.untyped], T::Array[T::Hash[String, T.untyped]]))) | ||
end | ||
def self.run_helper_subprocess(command:, function:, args:, env: nil, | ||
stderr_to_stdout: false, | ||
allow_unsafe_shell_command: false, | ||
error_class: HelperSubprocessFailed) | ||
error_class: HelperSubprocessFailed, | ||
command_type: :network, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH this feels more complex/confusing than it needs to be. This introduces magic symbols that callers must know to use, where those symbols are mapped to timeout values in the new command helpers class.
There's also some complexity coming from the addition of both command type and timeout arguments. As a caller if I didn't know the internals of this function I would be confused that a command that was run with timeout: -1
would have a timeout applied because command_type: :network
has a special meaning. Is timeout an explicit value? An implicit value? I need to dig into the code to know how to call this.
WDYT about creating constants for different common timeout values and having callers use those constants to set a timeout value when needed. This removes having two params, makes the timeout field explicit, and maintains named common timeout values.
module CommandTimeouts
NETWORK = 120 # seconds
LONG_RUNNING = 300 # seconds
NONE = -1 # seconds
LOCAL_TASK = 30 # seconds
DEFAULT = NETWORK
end
def self.run_helper_subprocess(... timeout: CommandTimeouts::DEFAULT...)
b6aef5f
to
78c2014
Compare
reduce timeout parameter complexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks for the updates! LGTM when CI is happy 😄
Thanks for the thorough review! I'll merge once the CI is happy. |
What are you trying to accomplish?
This PR introduces a timeout mechanism for running helper subprocesses to ensure they do not run indefinitely, enhancing system stability and robustness. The change is gated behind the feature flag:
enable_shared_helpers_command_timeout
.What and Why:
capture3_with_timeout
method to manage subprocess execution with a configurable timeout.run_helper_subprocess
method by replacing the existingOpen3.capture3
call withcapture3_with_timeout
.Note: The timeout applies to the time between outputs to
stdout
orstderr
, not the overall execution duration of the command. Each time the command produces output, the timeout is reset, and it waits for the next response.enable_shared_helpers_command_timeout
feature flag. This ensures no impact to existing workflows unless the flag is enabled.Related Issues:
Anything you want to highlight for special attention from reviewers?
enable_shared_helpers_command_timeout
feature flag to ensure it can be toggled on/off safely.Process.kill
andProcess.wait
) to avoid zombie processes.How will you know you've accomplished your goal?
Checklist